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Abstract 

Polyadenylation is a cotranscriptional nuclear RNA processing event involving endonucleolytic cleavage of the nascent, 
emerging pre-messenger RNA (pre-mRNA) from the RNA polymerase, immediately followed by the polymerization of 
adenine ribonucleotides, called the poly(A) tail, to the cleaved 3' end of the polyadenylation site (PAS). This apparently simple 
molecular processing step has been discovered to be connected to transcription and splicing therefore increasing its potential 
for regulation of gene expression. Here, through a bioinformatic analysis of c/s-PAS-regulatory elements in mammals that 
includes taking advantage of multiple evolutionary time scales, we find unexpected selection pressure much further 
upstream, up to 200 nt, from the PAS than previously thought. Strikingly, close to 3,000 long (30-500 nt) noncoding 
conserved fragments (CPs) were discovered in the PAS flanking region of three remotely related mammalian species, human, 
mouse, and cow. When an even more remote transitional mammal, platypus, was included, still over a thousand CPs were 
found in the proximity of the PAS. Even though the biological function of these CPs remains unknown, their considerable 
sizes makes them unlikely to serve as protein recognition sites, which are typically <15 nt. By harnessing genome wide 
DNasel hypersensitivity data, we have discovered that the presence of CPs correlates with chromatin accessibility. Our study is 
important in highlighting novel experimental targets, which may provide new understanding about the regulatory aspects of 
polyadenylation. 
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Introduction 

The majority of vertebrate protein-coding messenger RNA 
precursors (pre-mRNAs) undergo required posttranscrip- 
tional modifications namely 5' capping, splicing, and 
polyadenylation, in the nucleus before being exported to 
the cytoplasm. Collectively, these are often called posttran- 
scriptional processing events even though these three 
processes are actually orchestrated cooperatively during 
transcription. RNA processing serves vital biological func- 
tions and is thought to facilitate diversity. Intriguingly, poly- 
adenylation is the only pre-mRNA modification out of the 
three that is preserved in all domains of life (Sarkar 1997; 
Portnoy and Schuster 2006), that is, prokaryotes, archaea, 
and eukaryotes. Despite its ancient origin, polyadenylation 
has become more sophisticated during the course of evolu- 
tion (ca. 3 billion years). Here, we use the more complex 



mammalian species as a model to investigate the possible 
regulatory elements of polyadenylation. 

All eukaryotic protein-coding mRNAs are polyadenylated 
except histones. Polyadenylation consists of two sequential 
enzymatic reactions, that is, the endonucleolytic cleavage 
of nascent pre-mRNA emerging from the transcription 
complex, immediately followed by the polymerization 
of adenosine nucleotides to the cleaved 3' end of the pre- 
mRNA molecule. The location of the endonucleolytic cleav- 
agesite, namelythepolyadenylationsite(PAS), isspecificde- 
spite the fact that —54% of human and —32% of mouse 
genes are found to possess more than one PAS (Tian et al. 
2005). The polyadenosine nucleotides polymerized at the 
3 ' end of the mRNA is commonly known as the poly(A) tail. 
The typical length of the poly(A) tail in mammals is 200-250 
nt long, whereas lower organisms tend to have a shorter pol- 
y(A) tail, for example, about 70 nt in yeast, 10-20 nt 



© The Author(s) 2011 . Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/ 
2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 



654 Genome Biol. Evol. 3:654-666. doi:10.1093/gbe/evr053 Advance Access publication June 24, 201 1 



Long Conserved Fragments Upstream of Mammalian Polyadenylation Sites 



GBE 



\n Escherichia coli {Kamiketal. 1987;Taljanidiszetal. 1987). 
Polyadenylation is a nontennplate driven process, in contrast 
to transcription and DNA replication. It takes place in 
the nucleus, however, not without exception as poly(A) tail 
lengthening and shortening is known to occur in the cyto- 
plasnn as best evidenced by exannples fronn Xenopus oocyte 
nnaturation and early embryogenesis (Pique et al. 2008). 

It is well known that vertebrate PAS activation requires 
a large protein complex and two distinct sequence ele- 
nnents, the first being a highly conserved hexanucleotide, 
called the poly(A) signal located 10-30 nt upstreann of 
the PAS. The two nnost prevalent forms of poly(A) signal 
in vertebrates are AAUAAA and AUUAAA, collectively called 
the canonical poly(A) signal. According to our unpublished 
data and that of others (Beaudoing et al. 2000; Tian et al. 
2005), AAUAAA and AUUAAA are found in approximately 
66% and 16% of mammalian genes, respectively. The sec- 
ond sequence element, called the downstream sequence 
element (DSE), begins —15 nt downstream of the PAS. 
Unlike the poly(A) signal, it has a quite degenerate consen- 
sus sequence enriched in uracil (U) and guanine (G) but not 
simple (GU)n repeats as reported previously using SELEX and 
NMR studies (Takagaki and Manley 1997; Perez Canadillas 
and Varani 2003; Salisbury et al. 2006). Due to its nucleotide 
bias, this downstream polyadenylation element is often 
named the U/GU-rich region. In addition, experimental data 
indicated that cleavage and polyadenylation occur deter- 
ministically at a fixed location (±10 nt) between the poly(A) 
signal and the U/GU-rich region. A recent computational 
study of PAS downstream sequences from various metazo- 
ans suggested that DSEs exhibit a 5' to 3' transition from 
UG-rich to U-rich (Salisbury et al. 2006), an observation 
consistent with our recent work (Ho et al. 2009). 

Both upstream and downstream c/s-polyadenylation 
elements have been studied experimentally and bioinfor- 
matically. Bioinformatic analysis discovered the enrichment 
of certain hexamers upstream, up to 100 nt, in human 
(Hu et al. 2005) or downstream, up to 60 nt, of PASs in 
metazons (Salisbury et al. 2006). Through experimental 
studies, various functions have been attributed to other 
c/s-regulatory elements including but not limited to, the 
inhibition of polyadenylation through a U-rich region down- 
stream of the PAS (Zhu et al. 2007), stabilization of the poly- 
adenylation complex by U-rich elements upstream of the 
PAS (Danckwardt et al. 2004, 2006, 2007; Kaufmann 
et al. 2004), alteration of polyadenylation by U/GU-rich 
elements downstream of the PAS (Liu et al. 2008), alteration 
of the cleavage step through proximal and distal G-rich 
elements downstream of the PAS (Phillips et al. 2004; Dalziel 
et al. 2007), and U1 A autoregulation through U1 A binding 
a polyadenylation inhibition element (PIE) (Boelens et al. 
1993; Gunderson et al. 1994, 1997). So far, these studies 
emphasized the presence of short (<15 nt) c/s-regulatory 
elements flanking up to 1 00 nt upstream the PASs. Further- 



more, other related studies found highly conserved regions 
(HCRs) in noncoding sequences (Duret et al. 1993; Duret 
and Bucher 1997), but no preferred locations were 
reported. Ten of the HCRs have been tested in a viral-based 
reporter gene assay (Spicher et al. 1998). Five HCRs were 
shown to affect mRNA stability and two affected translation 
efficiency. All these studies have largely ignored the possi- 
bility that highly conserved elements could be effecting 
3' end processing (Siepel et al. 2005). 

Here, we undertake a frans-mammalian (human, mouse, 
cow, platypus) analysis of sequences flanking the PAS and 
report the following new findings: 1 ) there is selection pres- 
sure not only for the highly conserved and short poly(A) sig- 
nal but also in the farther upstream region (up to 200 nt) of 
the PAS, 2) there is a prevalence of long (>30 nt up to 500 
nt) conserved fragments (CFs) up to 200 nt upstream of the 
PAS in distant mammalian species, and 3) there is evidence 
for a role of CFs in chromatin architecture. 

Materials and Methods 

EST-Based PASs 

Human and mouse expressed sequence tags (ESTs) obtained 
from the NCBI Refseq database (ftp://ftp.ncbi.nih.gov/ 
refseq/release/) were used to compile a set of reliable PASs. 
Only polyadenylated ESTs, that is, either beginning with six 
or more T's or ending with six or more A's, were selected. 
Those polyadenylated ESTs were then mapped to the respec- 
tive genome in order to make sure that the T/A-tracks of the 
ESTs did not originate from the genome and to determine 
the direction of transcription. By using this method, 17,090 
and 8,779 human and mouse PASs were collected, respec- 
tively. Detailed procedures can be found in Supplement B 
(Supplementary Material online). 

Selection Pressure in PAS Flanking Regions 

We used the substitution rate of the orthologous PAS flank- 
ing regions among different organisms to measure the 
degree of selection pressure. Unfortunately, noncoding 
regions such as the 3' untranslated regions (UTRs), which 
embody the PASs, are generally not conserved among 
remote species, making sequence alignment unfeasible. 
Furthermore, nucleotide sequence comparison often suffers 
from the homoplasy effect, that is, a given recent substitu- 
tion is reverted to its ancestral form over a long evolutionary 
time unless the interested region is subjected to high selec- 
tion pressure. To overcome this issue, the approach 
to harness highly similar genomes between close species 
genomes was adopted to examine the existence of selection 
pressure flanking the PAS. Three pairs of close species 
were used: namely human-chimpanzee, human-rhesus 
(rhesus macaque), and mouse-rat. The percentages of 
genome identity between human-chimpanzee and 
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Fig. 1. — Mismatch ratio. Green lines on the left illustrate 600-nt 
real PAS flanking regions. Gray lines on the right represent control 
sequences. Cross symbols represent a mismatch detected between 
a close species pair. The mismatch ratio is computed for each position, 
denoted by /, across all sequences. 



human-rhesus are 99% (Chimpanzee Sequencing and 
Analysis Consortium 2005) and 94% (Rhesus Macaque Ge- 
nome Sequencing Consortium 2007), respectively. For the 
mouse-rat pair, only 8-10% of substitutions were accumu- 
lated because they diverged -12 to 24 Ma (Rat Genome 
Sequencing Project Consortium 2004). 

If a given genomic region is subjected to neutral selection, 
one would expect random substitutions to be distributed 
evenly along that region; otherwise, they are either localized 
or depleted in that region. Based upon this intuition, the fol- 
lowing procedure was devised to examine the extent of 
selection pressure flanking the PAS. 

1. Obtain 17,080 human and 8,799 mouse PASs from 
our EST-based PAS database (described in the 
Supplement B, Supplementary Material online). 

2. Consider regions [-300, +300] (hereafter, in con- 
vention [-M, +N], M denotes nucleotides upstream 
and N denotes nucleotides downstream of the PAS). 

3. Identify homologous PASs in close species pairs 
human-chimpanzee, human-rhesus, and mouse- 
rat using NCBI-BlastN (Camacho et al. 2009). 

4. Remove sequences from genes with 3' UTRs < 500 
nt or that contain coding regions as found in some 
genes with multiple PASs. 

5. Compile a control data set that is of the same length 
and number as the PAS sequences from step 1 . Three 
control sequences were prepared from random 
locations in the intergenic region, open reading 
frame (ORF) (spliced), and intronic region with 
canonical poly(A) signals. 

6. Examine the mismatch ratio (explained below and in 
fig. 1) for each position among homologous pairs in 
[-300, +300] of the PAS. 

Using the above procedure, 16,835, 16,759, and 8,604 
pairs of homologous PASs were found between human- 
chimpanzee, human-rhesus, and mouse-rat, respectively. 
For both real and control result sets, the number of mis- 
matches was counted between each pair of close species 
for each position along the [-300, +300] region. Then, 
the two mismatch counts were combined into a ratio per 



position as shown in figure 1. (Note: the mismatch ratio 
was set to undefined during plotting if the number of 
mismatches in control sequences was zero. Because large 
numbers of PAS regions were considered, this situation 
was found to occur only in the first and last three positions 
at either end, thereby not affecting the overall analysis.) The 
mismatch ratio reflects the comparative substitution rate in 
PAS flanking regions versus control sequences. A value close 
to 1, >1, and <1 indicates neutral, higher, and lower 
substitution rates, respectively, in the PAS flanking regions 
as compared with the control. The choice of control sequen- 
ces was based on the assumption that intergenic sequence is 
subjected to the least selection pressure, whereas the stron- 
gest pressure is in the ORF. The comparison of the PAS flank- 
ing region with these two extremes enables us to 
understand the magnitude of selection pressure. Besides 
the PAS flanking region, other types of genomic sequences, 
such as 5' splice sites, parts of the 3' UTR far from the PAS 
and introns, were included in this study in order to confirm 
the validity of this method. The degree and the extent of 
conservation of the region flanking the PAS were examined 
by plotting the mismatch ratio for these two pairs of close 
species (fig. 1). 

Conserved Fragments 

Four evolutionarily remote mammalian species were chosen 
in this analysis namely human, mouse, cow, and platypus. 
Gene homologous information (based on proteins) of 
human, mouse, and cow was obtained from the NCBI 
HomoloGene database (HomoloGene 2009). As the ge- 
nome of platypus was completed only recently, little expres- 
sion data is available to obtain its homologous information 
with other species. To circumvent this, human PAS flanking 
sequences were used to search against the platypus genome 
in order to identify homologous regions in platypus. Because 
two different ways were used to obtain the homologous 
information, the four mammalian species were divided into 
two homologous groups, namely HMC, which was com- 
posed of human, mouse, and cow and HMCP, which 
contained all four species. 

To explore the conservation of the region that spans 
[-500, +500] while avoiding the influence of the ORF, 
genes possessing 3' UTRs shorter than 500 nt or regions 
overlapping with the ORF were dropped from the analysis. 
Low complexity and repeat fragments were masked off from 
the sequences using RepeatMasker (Smit et al.). The multiple 
sequence alignment tool T-COFFEE (Notredame et al. 2000) 
was then used to align the PAS flanking regions for each or- 
thologous group. A score value, in the range of 0-100, was 
returned from each alignment process, where 0 and 1 00 rep- 
resents no and perfect alignment, respectively. Based on the 
alignment report, CFs were extracted from each orthologous 
gene group and duplicated fragments were eliminated for 
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those genes having nnultiple closely spaced PASs. Altogether, 
1 0,765 and 5,362 orthologous genes from groups HMC and 
HMCP were aligned, respectively. A 15-nt sliding window 
was used to scan the alignnnent, base by base. For a given 
gene, a "good" alignment was defined to be <3 mismatches 
(80% identity) and overlapping good windows were then 
stitched together to form the final CF for that gene. 

For control purposes, we selected two other genomic re- 
gions as controls, one was the 500-nt region downstream of 
PAS and the other was the 500-nt region at the 5' most of 
the 3' UTR given its length was at least 1,000 nt. The min- 
imum length requirement ensured the CFs, which were 
found in the control region, did not overlap with those in 
the flanking region of the PAS. We named the former con- 
trol the downstream control and the latter the 5' control. 
The same CF searching procedure was used to identify 
CFs in the two control regions in the HMC group. 

Chromatin Structure of PAS 

We investigated the openness of chromatin in PAS flanking 
regions through DNasel hypersensitivity (HS) studies in four 
different human cell lines, one normal (H1-hESC) and three 
transformed (K562, HeLa S3, and GM12878). No equivalent 
mouse data were used in our analysis as there has been in- 
sufficient analysis of this type using mouse cell lines. All data 
were downloaded from the ENCODE project (Rosenbloom 
et al. 201 0) hosted in the UCSC Genome Browser (Karolchik 
et al. 2003) website (http://genome.ucsc.edu/ENCODE/). 
Two independent HS data sets were obtained, one 
was produced by ENCODE Open Chromatin Map from 
the Crawford/Duke, Leib/UNC, and Lyer/UT-Austin labs 
(Crawford, Davis, et al. 2006; Crawford, Holt, et al. 2006; 
Boyle et al. 2008) and the other from the University of Wash- 
ington DNasel HS by Digital DNasel (Sabo et al. 2004). 

In all data sets mentioned above, high-resolution raw se- 
quence data were mapped to 16,730 human PAS flanking 
sequences having no overlap with the ORF along regions 
[-500, +50] for each cell line. As a result, each sequence 
is associated with the number of DNasel cut sites. However, 
only nonzero data were included in our analysis as zero data 
have an ambiguous interpretation of being either nonde- 
tectable or insensitive to DNasel. We split these mapped re- 
sults into three groups according to the size of the CF 
namely 11,669 without CF, 4,522 with short CF (>30 
and <200 nt), and 534 with long CF (>200 nt). 

In order to evaluate any statistically significant differences 
among these three groups in terms of HS mapping, we em- 
ployed the resampling technique followed by a two-sample 
Mest. In each run, we performed 2,000 rounds of sampling. 
The average DNasel cut sites in each round was computed 
by sampling equal numbers of PASs from the two compared 
groups; the sample size was set to 50% of the smaller 
group. The distributions of these 2,000 pairs of sample 
averages from the two compared groups were confirmed 



to exhibit normality using Q-Q normal plot (through 
qqnorm function in R) that justified the use of Mest for 
our purpose. 

Results 

Selection Pressure on the Farther Region 200 nt 
Upstream of the PAS 

The mismatch ratios between flanking PASs for human- 
rhesus and mouse-rat pairs are illustrated in figure 2. 
Because human and chimpanzee diverged more recently 
at only ~6 Ma as compared with —12 to 24 and 25 Ma 
for mouse-rat and human-rhesus, respectively (Rat 
Genome Sequencing Project Consortium 2004; Rhesus 
Macaque Genome Sequencing Consortium 2007), the plot 
of mismatch ratios for the human-chimpanzee pairs 
experiences substantial random fluctuations, therefore in 
the main text, we will present data from human-rhesus 
and mouse-rat pairs only, whereas the mismatch ratio plots 
for human-chimpanzee pair can be found in Supplement C 
(Supplementary Material online). 

In figure 2, the blue line represents the mismatch ratio 
between the real PAS and the intergenic control sequence, 
likewise for the green line except that the control is changed 
to the ORF. The gray line represents the comparison be- 
tween the two types of control sequences, that is, ORF/ 
3' UTR versus intergenic. 

For the human-rhesus pair (fig. 2A), the mismatch ratio of 
real PAS sequence versus intergenic sequence (blue line) is < 1 
for the entire region indicating a stronger selection pressure 
in the PAS sequences than in the intergenic sequences. 
However, the experienced selection pressure is weaker than 
the pressure to preserve the ORF (green line) except for the 
region —30 nt upstream of the PAS, which is the preferred 
location of the poly(A) signal. A similar pattern is observed 
between the mouse-rat comparison as shown in figure 
2B. In addition, the region upstream of the poly(A) signal 
not only experienced a stronger selection pressure than 
the region downstream but also covered a wider region as 
indicated by the drastic drop of the mismatch ratio after 
— 50 nt from the PAS as shown in figure 2A and B. This asym- 
metrical pressure is not caused by any possible uneven 
selection pressure in the two types of control sequences along 
the considered region because the mismatch ratio line (gray 
line) for ORF versus intergenic stays at a steady level (-0.5) 
across the entire region. 

In order to determine the range of the selection pressure 
on the upstream region starting from the poly(A) signal, the 
first 600 nt of 3' UTR, equivalent to the 5' part of the 3' 
UTR, was chosen as a control rather than the ORF as the 
PAS upstream flanking region is actually part of the 3' 
UTR. Differences in selection pressure between the 5' part 
of the 3' UTR and the 3' UTR near the PAS can thus be 
ascribed to the PAS. As shown in figure 2C and D, when 
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Fig. 2. — Selection pressure by close species comparison. Shown is the mismatch ratio in the PAS flanking region between close species. {A and B) 
The mismatch ratio variation for region [-250, +200] and (Cand D) the PAS flanking region versus 3' UTR. 



the 5' part of the 3' UTR is taken as the control, the nnis- 
match ratio (green) line asymptotically approaches 1 in the 
upstream direction and becomes flat by -200 nt upstream 
of the PAS. The horizontal mismatch ratio plot (gray line in 
fig. 2Cand D) between the 3' UTR and the intergenic region 
is similar to that of the ORF versus intergenic in figure 2A and 
B indicating the 3' UTR does not exhibit uneven selection 
pressure across the considered region. The data also indicate 
3 ' UTRs experience a lower substitution rate than intergenic 
sequences, which is in agreement with prior studies that 
many expression-related regulatory elements are located 
in the 3' UTR (Xie et al. 2005) but with less clear positional 
preference. 

Although the above close species analysis supports the 
existence of selection pressure flanking the PAS, it is prudent 
to do several types of control analysis to rule out alternative 
explanations such as artifacts inherent in the computation 



methods and alternative biological mechanisms. One possi- 
ble artifact is the NCBI-Blast algorithm favors alignment of 
sequences in the middle of an alignment over sequences 
near the two ends. To refute this possibility, we repeated 
the same analysis as in figure 2A and B except the region 
of interest was shifted upstream or downstream by 200 
nt. The pattern in these plots remains largely unchanged 
except it is shifted to the left or right (supplementary figs. 
E1-E4, Supplementary Material online). Hence, alignment 
bias can be ruled out in this study. To examine whether 
the selection pressure pattern depends on proximal repeats 
of PAS, only the single PAS genes were selected to produce 
the plot. We have found that the same pattern persists 
in both close species pairs (supplementary figs. E5 and 
E6, Supplementary Material online). 

Another possible reason for the selection pressure pattern 
may be due to the presence of the highly conserved poly(A) 
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signal AWUAAA (W = A or U). To exannine this, a set of 600-nt 
long intronic sequences (17,080 from human, 8,799 from 
mouse) with AWUAAA positioned —270 nt from the 5' end 
was randomly sampled. We dub this the pseudo-PAS sequence 
set and more details on its assembly can be found in 
Supplement D (Supplementary Material online). Analysis of 
this sequence data set can be found in supplementary figs. 
E7 and E8 (Supplementary Material online). Results clearly 
showed that these sequences had no selection pressure 
pattern as the mismatch ratio is close to 1 when compared with 
intergenic sequences. Thus, the highly conserved hexanucleo- 
tide by itself failed to reproduce the same asymmetrical pattern 
exhibited by the real PAS flanking region. Moreover, if the dis- 
tinct mismatch ratio pattern were simply caused by the highly 
conserved poly(A) signal, then, figure 2A-D should show 
a symmetric pattern too, however, nothing of that is observed 

The same analysis was also applied to the 5' splice site 
(5'ss) region found in the first exon as it is well documented 
that 5'ss recognition is facilitated by the presence of 
short sequence elements located immediately upstream 
of the 5'ss (Fairbrother et al. 2002; Wang et al. 2004). These 
sequence elements, commonly known as exonic splicing 
enhancers, are targets of serine-rich proteins (SR proteins) 
(Graveley 2000). Because 5'ss splicing enhancers are essen- 
tial for pre-mRNA processing, they must be subjected to 
positive selection pressure. The mismatch ratio has the 
lowest value just upstream of the 5'ss and then rises 
abruptly immediately after the exon-intron junction in 
the 5' to 3' direction. Plots can be found in supplementary 
figs. E9 and E10, Supplementary Material online. 

Finally, 30% and 38% of human and mouse genes were 
found either to overlap or be close (< 1 ,000 nt separation) to 
a neighboring gene. To examine whether such close prox- 
imity or overlap with a gene influences this analysis, such 
genes were removed from the initial data set leaving 
12,195 and 5,553 pairs of human-rhesus and mouse-rat 
homologous PAS regions (supplementary figs. Ell and 
El 2, Supplementary Material online). There is no observable 
difference in the variation of mismatch ratio with respect to 
the unfiltered sequences (fig. 2A and B). Thus this battery of 
analysis has confirmed the presence of selection pressure on 
sequences within 0-200 nt upstream of the PAS. 

Percentage of Alignment of PAS Flanking Regions 
Among Remote Mammals 

The close species comparison presented above revealed the 
presence of selection pressure 200 nt upstream of the PAS, 
supporting the existence of other nonrepetitive c/s-elements. 
Although previous attempts in identifying c/s-acting PAS 
elements were successful in capturing the enrichment of 
short and fixed-size sequence motifs, such attempts largely 
neglected the huntforevolutionarily conserved gene-specific 
elements. In order to identify the sequences preserved by this 
selection pressure, we switched the evolutionary time scale 



from close to distant mammalian species for this task. Four 
mammalian species were selected namely human, mouse, 
cow, and platypus. 

The multiple alignment program T-COFFEE was used to 
align 10,765 and 5,362 orthologous gene groups in HMC 
and HMCP, respectively. The relationship between the frac- 
tion of alignment by position was plotted separately by 
alignment score as shown in figure 3. Two alignment score 
thresholds were used namely 50 and 70. Empirically, an 
alignment score above 50 generally indicates the presence 
of long fragments (>30 nt). Note that higher alignment 
scores are often associated with longer and/or multiple CFs. 

Red and blue lines denote high and low scoring groups, 
respectively. Each line represents the variation in fraction of 
genes containing the same nucleotide as human along the 
flanking region of PAS. In total, 5,261 of 10,765 genes or 
49% were found to achieve higher than 50 alignment score 
in the HMC group (fig. 3A). In the HMCP group, 2,668 of 
5,362 genes or 50%, similar to the HMC group were found 
to exceed a 50 alignment score. When a more stringent 
threshold, 70, was adopted, the number of genes dropped 
to 2,160 (20%) for the HMC group and the HMCP group 
dropped even more to 629 genes (12%). But raising 
the threshold resulted in a higher fraction of alignment 
(compare fig. 3A and C or between B and D). 

Not surprisingly, for both high and low scoring groups, 
the best alignment was attained at around 21 nt upstream 
from the PAS, which is the preferred location of the poly(A) 
signal. The peak occurred at 31 nt instead of 21 nt upstream 
in the HMCP group with threshold 70 (fig. 3D), the fractions 
of alignment between them differ by 3 percentage points 
only. The trend of the plot resembles that of the close species 
comparison method where selection pressure is asymmetri- 
cal, that is, higher in strength and range in the upstream 
than the downstream region. However, the degree of align- 
ment seems to extend farther than 200 nt upstream for 
a subset of high scoring genes as revealed in figure 3C 
and D. In total, 1,080 of 2,160 orthologous HMC-group 
genes show a high degree of alignment but not necessarily 
in one continuous stretch, for up to 400 nt upstream. This 
observation provides an intriguing opportunity to look into 
the conservation of the noncoding sequence of each gene. 

Identification of CFs 

The two independent methods, close and remote species 
comparisons, presented here suggest the conservation 
pressure is prominent upstream rather than downstream 
of the PAS, thus the rest of the analysis will concentrate on 
the upstream region only. Based on the multiple alignment 
results, CFs were extracted from genes with alignment 
scores > 50, longer than 30 nt, and limited to one fragment 
per nonoverlapping gene. Altogether, 2,987 and 1,130 non- 
redundantconservedupstreamfragmentswerediscoveredin 
the HMC and HMCP groups, respectively. In the two control 
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Fig. 3. — Fraction of alignment along the PAS flanking positions. Red and blue lines denote high and low scoring groups, respectively. (A) HMC 
group with threshold 50, (B) HMCP with threshold 50, (0 HMC with threshold 70, and (D) HMCP with threshold 70. 



groups, the threshold was set to 50. CFs were found five and 
two times more in the upstream PAS flanking region than in 
the downstream, and the 5' control regions, respectively. 
(More details can be found in Supplement F, Supplementary 
Material online.) The distribution of their lengths is shown in 
figure 4 where almost two-thirds of the CFs were between 30 
and 1 00 nt long in the HMC group. Several CFs were found to 
be as long as 400-500 nt (fig. 4/\ and B). As expected, smaller 
numbers of CFs were found in the HMCP group, however, 
both groups exhibit similar distribution (fig. 4A and B). 

Distance of CF From PAS 

To explore the CF to PAS distance (based on 3' end of CF), 
the relationship between fragment length and proximity to 
the PAS was examined. Figure 5 displays the distribution of 
the distance of these human CFs from the PAS in both the 
HMC and the HMCP groups. Almost half of the CFs were 



found to reside within 20 nt from the PAS in the HMC group 
(fig. 5/\), and the remaining CFs were uniformly distributed 
along the upstream region, suggesting there is no particular 
relation between the size of the CF and proximity to the PAS. 
A consistent picture is found in both the HMC and the 
HMCP groups (fig. 50- Furthermore, the length of CFs that 
were found within 20 nt from the PAS were analyzed as 
shown in figure 5^ and D. Their distribution closely resem- 
bles the overall distribution of CFs where the majority of 
them were between 30 and 100 nt long. 

Examples of CFs 

A sample of alignments and CFs for three genes will be illus- 
trated namely polypyrimidine tract binding protein 2 (PTBP2), 
FBJ murine osteosarcoma viral oncogene homolog (FOS), 
and oligodendrocyte transcription factor 1 (0LIG1). These three 
genes manifest different degrees of conservation near the PAS. 
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All alignnnents can be found in Supplennent G (Supplementary 
Material online). PTBP2 and FOS are extrenne exannples as they 
contain 400 to nearly 500-nt long CFs starting fronn the PAS in 
the 5' direction. PTBP2 is reported to control the assennbly of 
other splicing regulatory proteins and binds to intronic polypyr- 
innidinetractsduring splicing. PTBP2 issinnilartoPTBPI exceptfor 
the fact that it is abundant nnainly in brain. It is evident there is 
a continuous stretch of CFs annong hunnan, nnouse, and cow 
including the poly(A) signal with the CFs being rich in A and T 
but not of low connplexity as repeated and low connplexity 
regions were removed before alignment. The degree of conser- 
vation is amazing, as it is even higher than the coding sequence. 

Another example is FOS, which is a well-studied onco- 
gene that regulates cell proliferation, differentiation, and 
transformation. The total conserved region of FOS, exclud- 
ing the repeat masked fragment, is about 400 nt. 

Not all CFs discovered here include the poly(A) signal 
like PTBP2 and FOS. For instance, a 34-nt long CF was found 
to locate ~ 1 00 nt upstream from the PAS. 0LIG1 is a transcrip- 
tion factor in oligodendrocyte development (Lu et al. 2001 ) that 
plays a role in remyelination after injury (Labombarda et al. 
2009). The small sample of genes discussed here is very limited, 
suggestive of a regulatory function yet to be discovered. 
Especially the region of conservation of OLIG 1 between human 
and mouse expands significantly A full list of alignments of the 
upstream region among the four mammalian species can be 
found in Supplement H (Supplementary Material online). 

CFs Are Gene Specific 

To examine whether our collection of CFs share sequence 
similarity, an exhaustive pairwise comparison was performed 
among CFs in ordertoclusterthemintogroups by subsequence 
similarity However, nosignificantsimilarity was found among 
them, which confirmed one previous study (Spicher et al. 



1998), except for three pairs of genes namely M0RF4L1/ 
MORF4L2, RPL27AP6/RPL27A, and TUBA3C/rUBA4A. Each 
pair shares about a 1 00 nt long highly similar fragment. For 
these pairs, their similarity is more likely due togene duplication 
rather than sharing a common regulatory binding site in the 
3' UTR because the proteins encoded by these genes also 
exhibit a high degree (77-97%) of identity 

Biological Function of CFs 

At present, the only long CF that has been studied experi- 
mentally is that of the U1A gene. An approximately 53-nt 
long CF, called the PIE, is conserved among mammalian 
U1A genes (alignment in Supplement I, Supplementary 
Material online) and binds two molecules of the U1 A protein 
upstream of PIE is a shorter (11 nt) conserved 5'ss-like 
sequence that was shown to bind the U1 snRNP splicing fac- 
tor (Guan et al. 2007). The collective action of the PIE and 
5'ss-like sequence is to repress the PAS as part of a negative 
autoregulatory feedback system. Thus, U 1 A is an example of 
a CF composed of smaller individual binding sites. 

Chromatin Structure of CFs 

In spite of the U1A example above, it seems very unlikely 
that most of these long CFs are target sites of RNA binding 
proteins as known sites are usually <15 nt, Hence, we 
speculate that they may serve a role in chromatin modeling. 
One approach to explore this aspect is through DNasel HS 
mapping, which is an accurate method to identify genomic 
regulatory regions such as promoters, enhancers, and 
silencers upstream of transcription start sites. Two previous 
studies were conducted to perform genome wide mapping 
of DNasel HS sites in an array of human tissues (Sabo et al. 
2004; Boyle etal. 2008). However, current HS mapping stud- 
ies mostly emphasized transcription regulatory elements. 
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Fig. 5. — Distance of human CFs (based on 3' end of CF) from the PAS. (A) Distance of CP from PAS in the HMC group, (B) length of 
CP < 20 nt from the PAS in the HMC group, (0 distance of CP from the PAS in the HMCP group, and (D) length of CP < 20 nt from the PAS in the 
HMCP group. 



Only one recent bioinformatic study has reported a depletion 
of nucleosomes near PASs (Spies et al. 2009) where it was 
suggested that such depletion is unlikely to be related to 
expression even if its function is largely unknown. Hence, un- 
derstandingofthechromatinstructureatthe3' endisstillvery 
limited. To examine this issue more carefully, we conducted 
a comprehensive HS mapping along the PAS flanking regions 
in human so as to elucidate the possible impact of CFs to the 
accessibility of chromatin near the PASs. 

Data from two independent DNasel HS data sets were 
mapped to 16,730 human PAS flanking regions [-500, 
+50] in four tissues. We chose these two data sets because 
they were performed independently at two different insti- 
tutions, used different protocols, and analyzed four cell 
types, HeLa S3, K562, Hu ESC, and GM12878 each with 
distinct features, three being transformed cells and one 



being a primary human embryonic stem cell. Furthermore, 
all four cell types are from distinct tissues and ESCs plurip- 
otent cells that are not yet committed to a differentiation 
pathway. These PAS flanking regions were further split into 
three groups namely no CF, short CF (<200 nt), and long 
CF (>200 nt). We also examined other threshold lengths, 
such as 1 00 nt, but no significant differences were seen in 
the outcome analysis. Results in figure 6A and B show that 
the presence of CFs makes the region less accommodative 
to DNasel endonucleoytic cleavage (red bar vs. blue/ 
purple bars), and the differences were statistically signifi- 
cant according to pairwise Mest (P values were in the range 
of -10"^^ to 10"^^). Furthermore, 7 (except K562(1), 
K562(2), and HeLa S3(1) from the University of Washing- 
ton's data set) of 1 1 samples exhibit size dependent chro- 
matin accessibility, indicating that PAS flanking regions 
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Fig. 6. — Comparison of chromatin structure of liuman PAS flanking regions [-500, +50] with and without CFs. Vertical axis represents the 
average number of DNasel sites per sequence, "no CF" (red), "short CF" (blue), and "long CF" (purple) denote PAS sequence without CF, with CF < 200 
nt, and CF > 200 nt, respectively. {A) DNasel HS data obtained from Duke/University of Utah/University of North Carolina. {B) Similar DNasel HS data set 
obtained from University of Washington. The bracketed number after the tissue labels GM12878, HeLa S3, and K562 represent replicate. 



with shorter CFs were cut more frequently by DNasel than 
those regions containing longer CFs (blue bar vs. purple bar 
in fig. 6A and B). 



Discussion 

Asymmetrical Selection Pressure Flanking PAS 

Results show that close species comparison is capable of 
revealing the radiation of asymmetrical selection pressure 
from the poly(A) signal. Such a finding reveals that the 
upstream region involved in polyadenylation is longer than 
reported previously (Legendre and Gautheret 2003; Hu et al. 
2005; Tian et al. 2005). Even though the requirement of the 
upstream poly(A) signal and downstream U/GU-rich region 
are well established, the asymmetrical selection pressure 
present in up to 200 nt upstream of the PAS suggests the 
existence of other unknown c/s-elements in the upstream 
region that may involve signaling the arrival of PAS to the 
transcription complex. 

Unlike 5'ss sequences, a sharp fall in the mismatch ratio is 
not observed in the upstream region (supplementary figs. E9 
and E10, Supplementary Material online). Three possible 
explanations may account for the lack of a sharp fall. First, 
the upstream binding factor(s) is flexible in acting at 
a distance. Such action-at-a-distance is common for RNA- 
based regulation and often derives from secondary and ter- 
tiary folding patterns of the RNA itself. Second, the selection 
pressure for the region [-200, - 1 00] is gene specific rather 
than basal and thus can only be seen when comparing 
orthologous genes as done here. Third, unlike frameshift 



mutations caused by mis-splicing, no severe drawback 
would be expected if cleavage occurs at a slightly (±5 nt) 
different position. According to previous studies (Legendre 
and Gautheret 2003; Hu et al. 2005; Tian et al. 2005), one 
characteristic of the upstream region is the gradual elevation 
of uracil composition in the 5' to 3' direction in the region 
[-100, -30]. The maximum increment is about 5% which 
happens immediately 5' of the poly(A) signal. One study 
asserted that a stronger PAS possesses higher uracil content 
than a weaker one (Hu et al. 2005). However, we have 
found that the entire set of human and mouse 3' UTRs, 
except the region 50 nt immediately after the stop 
codon and the last 100 nt at the 3' end, is evenly enriched 
with uracil (—29%) and adenine (—27%) (Supplement J, 
Supplementary Material online). A similar observation has 
also been reported in diverse species (Graber et al. 1999). 
If the polyadenylation machinery solely relies on a uracil-rich 
signal, false signals in the 3' UTR should appear more 
frequent than the real one. Even taking the two canonical 
poly(A) signals into account to enhance specificity, such an 
idea helps little to improve the recognition of PAS as poly(A) 
signals occur ubiquitously. Close to 3.4 and 2.2 million 
canonical poly(A) signals were found in human and mouse 
introns, respectively. Examination of the region [-500, +500] 
in those intronic sequences show they contain 30% A and T, 
which is similar to the 3' UTR in terms of nucleotide 
composition. Hence, it is likely that additional gene-specific 
c/s-elements are preserved by nature near the PAS. 

Besides, it is intriguing to notice that even in the absence 
of good alignment in the low scoring plots (blue) as shown 
in figure 3, these plots still exhibit an asymmetrical pattern 
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between the upstream and downstream regions around the 
PAS, suggesting the possibility of degenerate and/or short 
sequence elements in the upstream region. 

Widespread Prevalence of CFs 

Assuming 2,500 human genes means that 5-12% (1,130 in 
HMCP and 2,987 in HMC) of all mammalian genes carry 
a CF near the PAS. Such a large proportion can hardly be 
accounted for by chance only, especially as we had elimi- 
nated overlapping genes as discussed previously in the 
Results. As shown in figure 4, large numbers of the CFs 
are longer than the well-studied AU-rich, U-rich, G-rich, 
and C-rich regions, which regulate mRNA stability within 
their target proteins suggesting these CFs utilize a novel 
mode of regulation. 

The approach discussed here complements previous 
work to search for overrepresented short and fixed-length 
c/s-elements of polyadenylation (Graber et al. 1999; Hu 
et al. 2005; Hutchins et al. 2008). Previous work may be pre- 
disposed with the model that these c/s-elements are binding 
targets of one or two factors. But the long CFs reported here 
are likely to play a role other than RNA protein recognition 
sites as they are much longer than known binding sites. 
Previous work identified long (>50 nt) and at least 70% 
conserved sequences in the noncoding regions among 
metazoans (Duret et al. 1993; Duret and Bucher 1997), 
and these sequences can be retrieved from the ACUTS 
database. But no analysis has been done to indicate their 
location bias near to the PAS as what we have shown here. 
A recent study has shown nucleosome depletion at around 
the [-100, +100] region (Spies et al. 2009). Double- 
stranded homopolymeric stretches of deoxyadenosine 
(10-20 nt) (Segal and Widom 2009), poly(A) signal and 
T-rich content are suggested for the diminishing of nucleo- 
somes for both high and low usage PAS. Another important 
insight comes from the study of ultraconserved elements 
(UCEs). By comparing human, mouse, and rat genomes, 
481 identical genomic segments longer than 200 nt were 
found, and they are also highly conserved in chicken 
and dog (Bejerano et al. 2004). Some of them function 
as long-range enhancers (Pennacchio et al. 2006), driving 
development (Woolfe et al. 2005), regulating splicing 
(Lareau et al. 2007; Ni et al. 2007), and epigenetic modifi- 
cation (Bernstein et al. 2006; Lee et al. 2006). At present, 
only one report demonstrated that the deletion of a subset 
of UCEs, postulated to be enhancers, could yield viable mice 
(Ahituv et al. 2007). Even though the CFs discovered here 
cannot be considered as ultraconserved, their conservation 
among distant mammalian species is so high and long that it 
is perplexing if they happen by pure chance during the 
course of evolution. Readers interested in the widespread 
of conservation among mammals may check the UCSC con- 
servation track (Karolchik et al. 2003). 



Possible Functions of CFs 

What may be the possible roles of these CFs? It is well 
established that the presence of a highly conserved poly(A) 
signal at -20 nt upstream and a U/GU-rich region at ~1 5 nt 
downstream from the PAS is sufficient to cause the polya- 
denylation machinery to cleave the nascent pre-mRNA from 
the transcription complex. Two efficiency elements located 
upstream of the PAS have been reported to promote 
polyadenylation. One was found in PAPOLA and PAPOLG 
genes (Venkataraman et al. 2005) with sequence consensus 
UGUAN. The other was A-rich sequence in the intronless 
MC4R gene (Nunes et al. 201 0). Many of the CFs reported 
here are located less than 20 nt from the PAS (fig. 5) and 
they lack significant sequence similarity except for the three 
probably duplicated genes. These observations indicate that 
most genes with CFs do not regulate by common factor. 

Half of the CFs were found closer than 20 nt upstream of 
the PAS, suggesting that they may be correlated to polya- 
denylation activity, otherwise there is no reason to support 
their biased proximity to the PAS. However, even with such 
positional preference, one cannot exclude the possibility 
that these CFs are required by other biological processes, 
such as mRNA stability and translation regulation as one 
in vitro study has shown these functionalities in 7 of the 
10 selected HCRs (Spicher et al. 1998). We speculate that 
these CFs may serve as gene-specific promoter elements, 
as PAS and promoters are known to influence each other. 
Even though CFs longer than 1 00 nt are unusual, one should 
not overlook the rest of the 30-1 00 nt long CFs as multiple 
RNA protein recognition sites could comprise a CF as in the 
case of the U1A gene's CF. 

CFs and Chromatin Structure 

Besides serving as protein binding targets, we have also 
investigated the chromatin modeling role of these CFs. 
According to our study, CFs correlate with a more compact 
chromatin structure though we are unsure about their 
impact on expression, regulation, and efficiency of polyade- 
nylation. Moreover, we have considered the methylation 
aspect of the PAS flanking region especially for the trimethy- 
lation of histone H3 Lysine 36 (H3K36me3) as it has been 
reported to be relevant to transcription termination (Lian 
et al. 2008). In addition, one similar study (Kolasinska-Zwierz 
et al. 2009) has shown that H3K36me3 chromatin marks 
are preferentially found in exonic rather than intronic 
sequences in Caenorhabitis elegans and such a methylation 
pattern is found to be conserved in human and mouse, 
indicating that H3K36me3 is of biological importance, likely, 
for mediating splicing. There are a number of established 
examples of the interactions of splicing factors with the pol- 
yadenylation complex (Lutz et al. 1996; Gunderson et al. 
1998; Shi et al. 2009). However, the current embargo-free 
genome wide H3K36me3 data are still inadequate to 



664 Genome Biol. Evol. 3:654-666. doi:10.1093/gbe/evr053 Advance Access publication June 24, 201 1 



Long Conserved Fragments Upstream of Mammalian Polyadenylation Sites 



GBE 



reconstruct a consistent picture about CFs, nnethylation, and 
3' end processing. Thus, nnore data of this kind and CP nnu- 
tagenesis studies are needed in thefuture in orderto elucidate 
the interplay between chromatin structure and polyadenyla- 
tion. 

Supplementary Material 

Supplementary figures B, C.1, C.2, E1-E12, G1-G4, 1, and J 
are available at Genome Biology and Evolution online 
(http://gbe.oxfordjournals.org/). 
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